Extending*the Stata capacity to show numbers

Paris Rira

Join Date: Dec 2022

Posts: 385
#1

Extending*the Stata capacity to show numbers

20 Feb 2023, 14:40

Dear Profs and colleagues,

I am going to sum up all values in one variable,total_sale. I used this code to reach the accumulated amount.

Code:

egen sum_sale= total(SV606501)

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double total_sale float sum_sale 51892 628288651264 538892 628288651264 96788 628288651264 0 628288651264 0 628288651264 9112 628288651264 2380 628288651264 0 628288651264 31511 628288651264 152561 628288651264 0 628288651264 2829 628288651264 0 628288651264 81294 628288651264 0 628288651264 89601 628288651264 119800 628288651264 281376 628288651264 125084 628288651264

The variable is firms sales so the quantities are massive. The state says that the final accumulation is -81294 628288651264 - but I am sure that it's more than this amount, I guess after some limited number the state does not add more values. Am I right? If so, what's the solution to reach the real number?
Thank you for your cooperation.

Cheers,
Paris
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 36058
#2

20 Feb 2023, 15:12

Use a double variable to hold really big numbers. But if all you want is a single sum, summarize will do it:

Code:

. sysuse auto, clear (1978 automobile data) . su weight, meanonly . di r(sum) 223440

See the help for summarize which documents saved results. r(sum) is not shown by default but it is accessible once summarize has finished.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#3

20 Feb 2023, 15:13

The one thing I am sure of is that you must do this sum as -egen double sumsale = total(SV606501)-. If you do not specify -double-, then the resulting variable will not be able to represent all of the digits in the sum and will start truncating the result. Of this I am absolutely certain. Whether this, alone, will solve your problem I cannot say. This would not lead to the total looking as if it had just stopped accumulating; it would just mean that a few low order digits would be inaccurate. I should also point out that even with -double-, which is the largest storage type available in Stata, only 16 digits of precision are possible. If your exact grand total has more than 16 significant figures, then it is not possible to store it as a numeric variable in Stata without some loss of information.

Since you say that it looks like the total just stopped accumulating altogether, there is another possibility, that may arise if the _N in your data set is very large. It is possible that at some point the accumulated total becomes so large that adding in the next value of SV606501 has no effect because it is so many orders of magnitude smaller than the accumulated total that it becomes, literally, a "rounding error" when Stata tries to add it. A way to mitigate this is to add up the numbers starting from smallest to largest. That way the total accumulates more gradually, and by the time the total is really big, so are the numbers being added to it. So:

Code:

sort SV606501 gen sumsale = sum(SV606501) replace sumsale = sumsale[_N]

Added: Crossed with #2.

Last edited by Clyde Schechter; 20 Feb 2023, 15:15.
3 likes
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Feb 2023, 16:45

Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

byte - 7 bits	-127	100
int - 15 bits	-32,767	32,740
long - 31 bits	-2,147,483,647	2,147,483,620
float - 24 bits	-16,777,216	16,777,216
double - 53 bits	-9,007,199,254,740,992	9,007,199,254,740,992

Comment

Paris Rira

Join Date: Dec 2022
Posts: 385

20 Feb 2023, 16:52

Thank you, prof Nick and prof Clyde,

I used the three mentioned approaches, The results are below.

Code:

 egen double sumsale = total( total_sale )

* Example generated by -dataex-. For more info, type help dataex
clear
input double(total_sale sumsale)
   3378 628288630485
  17750 628288630485
      0 628288630485
  95449 628288630485
 217467 628288630485
 110950 628288630485
 102081 628288630485
      0 628288630485
      0 628288630485
      0 628288630485
 110948 628288630485

Code:

su total_sale , meanonly

.di r(sum)
6.283e+11
=(628,300,000,000)

Code:

 sort total_sale

. gen sumsale = sum(total_sale)

. replace sumsale = sumsale[_N]


input double total_sale float sumsale
      0 628288651264
    147 628288651264
  92500 628288651264
      0 628288651264
  14951 628288651264
   2030 628288651264
      0 628288651264
1966782 628288651264
      0 628288651264
  30253 628288651264
      0 628288651264
 329115 628288651264
      0 628288651264
1253306 628288651264
 103356 628288651264
  17730 628288651264
 134827 628288651264
      0 628288651264
      0 628288651264
  53360 628288651264
 103359 628288651264
      0 628288651264
      0 628288651264
      0 628288651264
  48985 628288651264

Do you think which one shall I pick up?

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Feb 2023, 18:07

Well, it seems likely that the actual sum is a 12-digit number beginning with a 6.

Two things we can learn.

1) The result must be stored in a double or precision will be lost.

2) Once that is said, then sorting - as Clyde suggested - is not necessary, because all the intermediate values will be stored to full precision,

The following example leads to some conclusions.

Code:

sysuse auto, clear
keep price
summarize price, meanonly
display r(sum)

// we need bigger numbers to reach a 12-digit total
replace price = price*1000000

summarize price, meanonly
display r(sum)
display %20.0fc r(sum)

generate float f_sum = sum(price)
replace f_sum = f_sum[_N]
generate double e_sum = sum(price)
replace e_sum = e_sum[_N]
egen float f_total = total(price)
egen double d_total = total(price)

format %20.0fc *_*
list *_* in l, noobs clean

Code:

. sysuse auto, clear
(1978 automobile data)

. keep price

. summarize price, meanonly

. display r(sum)
456229

.
. // we need bigger numbers to reach a 12-digit total
. replace price = price*1000000
variable price was int now double
(74 real changes made)

.
. summarize price, meanonly

. display r(sum)
4.562e+11

. display %20.0fc r(sum)
     456,229,000,000

.
. generate float f_sum = sum(price)

. replace f_sum = f_sum[_N]
(73 real changes made)

. generate double e_sum = sum(price)

. replace e_sum = e_sum[_N]
(73 real changes made)

. egen float f_total = total(price)

. egen double e_total = total(price)

.
. format %20.0fc *_*

. list *_* in l, noobs clean

              f_sum             e_sum           f_total           e_total  
    456,228,995,072   456,229,000,000   456,228,995,072   456,229,000,000  

.

The first group highlighted in red tells us that the summarize command result r(total) yields a correct result when displayed with a suitable format.

Now, both generate ... sum() and egen ... total() do their calculations in double and yield the correct result when a double variable is created, but not when a float variable is created.

So the important lesson is that whichever of the three techniques you choose, if you are going to store the results in a variable, that variable - for this data - must be a double to hold the result in full precision.

Code:

summarize total_sale , meanonly
generate double sumsale = r(total)

generate double sumsale = sum(total_sale)
replace sumsale = sumsale[_N]

egen double sumsale = total(total_sale)

Announcement